Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Mariia Dubyk
November 1, 2022
library(readxl)
marriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
sheet = "Table 2",
col_names = c("Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Delete", "Delete", "Delete", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage"))
marriage <- slice(marriage, 8:180)
marriage <- select(marriage, "Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage")
The dataset shows responses to the question in Australian Marriage Law Postal Survey 2017. We can see data gathered in different divisions of States and Territories in Australia. The responses are “Yes”, “No”, “Not clear” and “No response”. The table also shows percentage of each answer and total numbers.
To tidy data I will leave only divisions in the first column. I am going to move State and Territories to another column by adding a variable (as it was shown during the class).I will also remove percentages and total numbers from the table.
marriage <- marriage%>%
mutate(StateTerritory = case_when(str_ends(Division, "Divisions") ~ Division), NA_character_)%>%
fill(StateTerritory, .direction = "down")
marriage <- filter(marriage, !str_detect(Division, "Division|Australia"))
marriage <- select(marriage, "Division", "Yes", "No", "Response_not_clear", "Non_responding", "StateTerritory")
marriage
# A tibble: 155 × 6
Division Yes No Response_not_clear Non_responding StateTerritory
<chr> <chr> <chr> <chr> <chr> <chr>
1 Banks 37736 46343 247 20928 New South Wales Divi…
2 Barton 37153 47984 226 24008 New South Wales Divi…
3 Bennelong 42943 43215 244 19973 New South Wales Divi…
4 Berowra 48471 40369 212 16038 New South Wales Divi…
5 Blaxland 20406 57926 220 25883 New South Wales Divi…
6 Bradfield 53681 34927 202 17261 New South Wales Divi…
7 Calare 54091 35779 285 25342 New South Wales Divi…
8 Chifley 32871 46702 263 28180 New South Wales Divi…
9 Cook 47505 38804 229 18713 New South Wales Divi…
10 Cowper 57493 38317 315 25197 New South Wales Divi…
# … with 145 more rows
In the final dataframe we will have 4 rows (with answers “Yes”, “No”,“Response_not_clear”, “Non_responding”) for each division.
Lets see if this works with a simple example.
# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA 363. 847.
2 USA 1990 NAFTA 1299. 367.
3 France 1980 EU 958. 761.
4 Mexico 1990 NAFTA 1825. 635.
5 USA 1980 NAFTA 882. 1407.
6 France 1990 EU 1443. 1994.
[1] 6
[1] 5
[1] 12
[1] 5
Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Any additional comments?
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 12 × 5
country year trade trade_direction trade_value
<chr> <dbl> <chr> <chr> <dbl>
1 Mexico 1980 NAFTA outgoing 363.
2 Mexico 1980 NAFTA incoming 847.
3 USA 1990 NAFTA outgoing 1299.
4 USA 1990 NAFTA incoming 367.
5 France 1980 EU outgoing 958.
6 France 1980 EU incoming 761.
7 Mexico 1990 NAFTA outgoing 1825.
8 Mexico 1990 NAFTA incoming 635.
9 USA 1980 NAFTA outgoing 882.
10 USA 1980 NAFTA incoming 1407.
11 France 1990 EU outgoing 1443.
12 France 1990 EU incoming 1994.
Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!
After pivoting the case is number of certain response in a division. I think it may be easier to visualize the data using this type of table. So we may say that data is tidy because it is easier to make next steps in analyzing data.
# A tibble: 620 × 4
Division StateTerritory Response_type Number
<chr> <chr> <chr> <chr>
1 Banks New South Wales Divisions Yes 37736
2 Banks New South Wales Divisions No 46343
3 Banks New South Wales Divisions Response_not_clear 247
4 Banks New South Wales Divisions Non_responding 20928
5 Barton New South Wales Divisions Yes 37153
6 Barton New South Wales Divisions No 47984
7 Barton New South Wales Divisions Response_not_clear 226
8 Barton New South Wales Divisions Non_responding 24008
9 Bennelong New South Wales Divisions Yes 42943
10 Bennelong New South Wales Divisions No 43215
# … with 610 more rows
---
title: "Challenge 3"
author: "Mariia Dubyk"
desription: "Tidy Data: Pivoting"
date: "11/01/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
- eggs
- australian_marriage
- usa_households
- sce_labor
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
```{r}
library(readxl)
marriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
sheet = "Table 2",
col_names = c("Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Delete", "Delete", "Delete", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage"))
marriage <- slice(marriage, 8:180)
marriage <- select(marriage, "Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage")
```
### Briefly describe the data
The dataset shows responses to the question in Australian Marriage Law Postal Survey 2017. We can see data gathered in different divisions of States and Territories in Australia. The responses are "Yes", "No", "Not clear" and "No response". The table also shows percentage of each answer and total numbers.
To tidy data I will leave only divisions in the first column. I am going to move State and Territories to another column by adding a variable (as it was shown during the class).I will also remove percentages and total numbers from the table.
```{r}
marriage <- marriage%>%
mutate(StateTerritory = case_when(str_ends(Division, "Divisions") ~ Division), NA_character_)%>%
fill(StateTerritory, .direction = "down")
marriage <- filter(marriage, !str_detect(Division, "Division|Australia"))
marriage <- select(marriage, "Division", "Yes", "No", "Response_not_clear", "Non_responding", "StateTerritory")
marriage
```
## Anticipate the End Result
In the final dataframe we will have 4 rows (with answers "Yes", "No","Response_not_clear", "Non_responding") for each division.
### Example: find current and future data dimensions
Lets see if this works with a simple example.
```{r}
#| tbl-cap: Example
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
year = rep(c(1980,1990), 3),
trade = rep(c("NAFTA", "NAFTA", "EU"),2),
outgoing = rnorm(6, mean=1000, sd=500),
incoming = rlogis(6, location=1000,
scale = 400))
df
#existing rows/cases
nrow(df)
#existing columns/cases
ncol(df)
#expected rows/cases
nrow(df) * (ncol(df)-3)
# expected columns
3 + 2
```
Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.
### Challenge: Describe the final dimensions
```{r}
```
Any additional comments?
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
#| tbl-cap: Pivoted Example
df<-pivot_longer(df, col = c(outgoing, incoming),
names_to="trade_direction",
values_to = "trade_value")
df
```
Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!
### Challenge: Pivot the Chosen Data
After pivoting the case is number of certain response in a division. I think it may be easier to visualize the data using this type of table. So we may say that data is tidy because it is easier to make next steps in analyzing data.
```{r}
marriage_new<-pivot_longer(marriage, col = c(Yes, No, Response_not_clear, Non_responding),
names_to="Response_type",
values_to = "Number")
marriage_new
```